608 research outputs found
Principal Boundary on Riemannian Manifolds
We consider the classification problem and focus on nonlinear methods for
classification on manifolds. For multivariate datasets lying on an embedded
nonlinear Riemannian manifold within the higher-dimensional ambient space, we
aim to acquire a classification boundary for the classes with labels, using the
intrinsic metric on the manifolds. Motivated by finding an optimal boundary
between the two classes, we invent a novel approach -- the principal boundary.
From the perspective of classification, the principal boundary is defined as an
optimal curve that moves in between the principal flows traced out from two
classes of data, and at any point on the boundary, it maximizes the margin
between the two classes. We estimate the boundary in quality with its
direction, supervised by the two principal flows. We show that the principal
boundary yields the usual decision boundary found by the support vector machine
in the sense that locally, the two boundaries coincide. Some optimality and
convergence properties of the random principal boundary and its population
counterpart are also shown. We illustrate how to find, use and interpret the
principal boundary with an application in real data.Comment: 31 pages,10 figure
Principal Sub-manifolds
We revisit the problem of finding principal components to the multivariate
datasets, that lie on an embedded nonlinear Riemannian manifold within the
higher-dimensional space. Our aim is to extend the geometric interpretation of
PCA, while being able to capture the non-geodesic form of variation in the
data. We introduce the concept of a principal sub-manifold, a manifold passing
through the center of the data, and at any point on the manifold, it moves in
the direction of the highest curvature in the space spanned by eigenvectors of
the local tangent space PCA. Compared to the recent work in the case where the
sub-manifold is of dimension one (Panaretos, Pham and Yao 2014)--essentially a
curve lying on the manifold attempting to capture the one-dimensional
variation--the current setting is much more general. The principal sub-manifold
is therefore an extension of the principal flow, accommodating to capture the
higher dimensional variation in the data. We show the principal sub-manifold
yields the usual principal components in Euclidean space. By means of examples,
we illustrate how to find, use and interpret principal sub-manifold with an
extension of using it in shape analysis
A statistical approach to the inverse problem in magnetoencephalography
Magnetoencephalography (MEG) is an imaging technique used to measure the
magnetic field outside the human head produced by the electrical activity
inside the brain. The MEG inverse problem, identifying the location of the
electrical sources from the magnetic signal measurements, is ill-posed, that
is, there are an infinite number of mathematically correct solutions. Common
source localization methods assume the source does not vary with time and do
not provide estimates of the variability of the fitted model. Here, we
reformulate the MEG inverse problem by considering time-varying locations for
the sources and their electrical moments and we model their time evolution
using a state space model. Based on our predictive model, we investigate the
inverse problem by finding the posterior source distribution given the multiple
channels of observations at each time rather than fitting fixed source
parameters. Our new model is more realistic than common models and allows us to
estimate the variation of the strength, orientation and position. We propose
two new Monte Carlo methods based on sequential importance sampling. Unlike the
usual MCMC sampling scheme, our new methods work in this situation without
needing to tune a high-dimensional transition kernel which has a very high
cost. The dimensionality of the unknown parameters is extremely large and the
size of the data is even larger. We use Parallel Virtual Machine (PVM) to speed
up the computation.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS716 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Optimal classification in sparse Gaussian graphic model
Consider a two-class classification problem where the number of features is
much larger than the sample size. The features are masked by Gaussian noise
with mean zero and covariance matrix , where the precision matrix
is unknown but is presumably sparse. The useful features,
also unknown, are sparse and each contributes weakly (i.e., rare and weak) to
the classification decision. By obtaining a reasonably good estimate of
, we formulate the setting as a linear regression model. We propose a
two-stage classification method where we first select features by the method of
Innovated Thresholding (IT), and then use the retained features and Fisher's
LDA for classification. In this approach, a crucial problem is how to set the
threshold of IT. We approach this problem by adapting the recent innovation of
Higher Criticism Thresholding (HCT). We find that when useful features are rare
and weak, the limiting behavior of HCT is essentially just as good as the
limiting behavior of ideal threshold, the threshold one would choose if the
underlying distribution of the signals is known (if only). Somewhat
surprisingly, when is sufficiently sparse, its off-diagonal
coordinates usually do not have a major influence over the classification
decision. Compared to recent work in the case where is the identity
matrix [Proc. Natl. Acad. Sci. USA 105 (2008) 14790-14795; Philos. Trans. R.
Soc. Lond. Ser. A Math. Phys. Eng. Sci. 367 (2009) 4449-4470], the current
setting is much more general, which needs a new approach and much more
sophisticated analysis. One key component of the analysis is the intimate
relationship between HCT and Fisher's separation. Another key component is the
tight large-deviation bounds for empirical processes for data with
unconventional correlation structures, where graph theory on vertex coloring
plays an important role.Comment: Published in at http://dx.doi.org/10.1214/13-AOS1163 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Fixed Boundary Flows
We consider the fixed boundary flow with canonical interpretability as
principal components extended on the non-linear Riemannian manifolds. We aim to
find a flow with fixed starting and ending point for multivariate datasets
lying on an embedded non-linear Riemannian manifold, differing from the
principal flow that starts from the center of the data cloud. Both points are
given in advance, using the intrinsic metric on the manifolds. From the
perspective of geometry, the fixed boundary flow is defined as an optimal curve
that moves in the data cloud. At any point on the flow, it maximizes the inner
product of the vector field, which is calculated locally, and the tangent
vector of the flow. We call the new flow the fixed boundary flow. The rigorous
definition is given by means of an Euler-Lagrange problem, and its solution is
reduced to that of a Differential Algebraic Equation (DAE). A high level
algorithm is created to numerically compute the fixed boundary. We show that
the fixed boundary flow yields a concatenate of three segments, one of which
coincides with the usual principal flow when the manifold is reduced to the
Euclidean space. We illustrate how the fixed boundary flow can be used and
interpreted, and its application in real data
Manifold Fitting under Unbounded Noise
There has been an emerging trend in non-Euclidean dimension reduction of
aiming to recover a low dimensional structure, namely a manifold, underlying
the high dimensional data. Recovering the manifold requires the noise to be of
certain concentration. Existing methods address this problem by constructing an
output manifold based on the tangent space estimation at each sample point.
Although theoretical convergence for these methods is guaranteed, either the
samples are noiseless or the noise is bounded. However, if the noise is
unbounded, which is a common scenario, the tangent space estimation of the
noisy samples will be blurred, thereby breaking the manifold fitting. In this
paper, we introduce a new manifold-fitting method, by which the output manifold
is constructed by directly estimating the tangent spaces at the projected
points on the underlying manifold, rather than at the sample points, to
decrease the error caused by the noise. Our new method provides theoretical
convergence, in terms of the upper bound on the Hausdorff distance between the
output and underlying manifold and the lower bound on the reach of the output
manifold, when the noise is unbounded. Numerical simulations are provided to
validate our theoretical findings and demonstrate the advantages of our method
over other relevant methods. Finally, our method is applied to real data
examples
Estimation of Ridge Using Nonlinear Transformation on Density Function
Ridges play a vital role in accurately approximating the underlying structure
of manifolds. In this paper, we explore the ridge's variation by applying a
concave nonlinear transformation to the density function. Through the
derivation of the Hessian matrix, we observe that nonlinear transformations
yield a rank-one modification of the Hessian matrix. Leveraging the variational
properties of eigenvalue problems, we establish a partial order inclusion
relationship among the corresponding ridges. We intuitively discover that the
transformation can lead to improved estimation of the tangent space via
rank-one modification of the Hessian matrix. To validate our theories, we
conduct extensive numerical experiments on synthetic and real-world datasets
that demonstrate the superiority of the ridges obtained from our transformed
approach in approximating the underlying truth manifold compared to other
manifold fitting algorithms
- …